Massive genome reduction predates the divergence of Symbiodiniaceae dinoflagellates

Abstract Dinoflagellates in the family Symbiodiniaceae are taxonomically diverse, predominantly symbiotic lineages that are well-known for their association with corals. The ancestor of these taxa is believed to have been free-living. The establishment of symbiosis (i.e. symbiogenesis) is hypothesized to have occurred multiple times during Symbiodiniaceae evolution, but its impact on genome evolution of these taxa is largely unknown. Among Symbiodiniaceae, the genus Effrenium is a free-living lineage that is phylogenetically positioned between two robustly supported groups of genera within which symbiotic taxa have emerged. The apparent lack of symbiogenesis in Effrenium suggests that the ancestral features of Symbiodiniaceae may have been retained in this lineage. Here, we present de novo assembled genomes (1.2–1.9 Gbp in size) and transcriptome data from three isolates of Effrenium voratum and conduct a comparative analysis that includes 16 Symbiodiniaceae taxa and the other dinoflagellates. Surprisingly, we find that genome reduction, which is often associated with a symbiotic lifestyle, predates the origin of Symbiodiniaceae. The free-living lifestyle distinguishes Effrenium from symbiotic Symbiodiniaceae vis-à-vis their longer introns, more-extensive mRNA editing, fewer (~30%) lineage-specific gene sets, and lower (~10%) level of pseudogenization. These results demonstrate how genome reduction and the adaptation to distinct lifestyles intersect to drive diversification and genome evolution of Symbiodiniaceae.

Introner elements (IEs) are a type of mobile element consisting of inverted and direct repeat motifs found at 5′-and 3′-end of introns in diverse eukaryotes [5,6].Recent research on dinoflagellates revealed that IEs are more abundant in free-living (within 10-12% of genes) than in symbiotic/parasitic species (0.8-6.0% of genes) [7,8].Although not as high as in other free-living dinoflagellates, the Ev genomes exhibit more IE-containing genes (5%) than do the S1 (4%) and S2 genomes (3%) (Fig. 2D).IEs have been postulated to be nonautonomous and their mobility is dependent on transposases encoded in dinoflagellate genomes [8].We recovered transposase protein sequences from most of the Symbiodiniaceae genomes in this study (Supplementary Table 13), suggesting a capacity for IEs to be mobile.

Editing of mRNAs
Editing of mRNAs allows Symbiodiniaceae to increase the variability in protein isoforms [9].
To assess mRNA editing in E. voratum, we focused on RCC1521, which has the most contiguous genome assembly, and compared the rates and types of mRNA editing against a representative from S1 (S. microadriaticum CCMP2467 [9]) and S2 (D. trenchii CCMP2556 [10]).Using a conservative approach to tease apart potential genomic polymorphism and mRNA editing using both transcriptome and genome data [11] (see Supplementary Methods), we identified 45,009 unique mRNA edited sites in E. voratum, about 13-fold and 4-fold greater than in S. microadriaticum and in D. trenchii, respectively (Supplementary Table 17).
A larger proportion of genes in E. voratum (9,158,28.5%)contain mRNA edits, compared to S. microadriaticum (774, 1.6%) and D. trenchii (4,227,7.6%),although the distribution of substitution types is similar among the three species (Supplementary Figs.6A-D).Most edits in D. trenchii and E. voratum (>60%) are located in exons (Supplementary Fig. 6E), whereas in S. microadriaticum edits are evenly split between exons and introns, although this may be due to the more-fragmented genome assembly used in Liew et al. [9].As also observed in S. microadriaticum [9], there is a slight bias for edits to occur near 5¢ ends of genes (within the first ~10% of gene length, P < 0.05; Supplementary Fig. 6F) and the edits tend to be located within 1 Kb of each other (versus the distribution expected at random, P < 0.05; Supplementary Fig. 6G).The high level of mRNA editing in Ev is consistent with data from another free-living dinoflagellate, Pr. cordatum (42,969 edited sites; 32,067 within 10,169 [12%] genes) [7].Together with our observation of fewer Ev-specific protein families than those specific to S1 or S2 (Fig. 2), these results suggest a more pronounced role of mRNA editing in generating functional diversity in free-living versus symbiotic dinoflagellate taxa.
The relationship and impact of this RNA editing on the encoded proteins remain to be investigated using proteomics.

Impact of symbiogenesis on phylogenetic signal in non-coding regions
Our alignment-based phylogenetic trees that were inferred using multiple protein families (Fig. 1A), the standard molecular marker of 18S rRNA gene (Supplementary Fig. 7A), and the ITS2 region (Supplementary Fig. 8A) were consistent with previously established phylogenies [12][13][14].We then analysed the phylogenetic signal of whole-genome sequence data using a k-mer-based alignment-free (AF) approach (see Methods), focusing on distinct genome-sequence regions following Lo et al. [15].The inferred AF phylogenies of introns (Supplementary Fig. 7B), repetitive regions (Supplementary Fig. 8B), repeat-masked wholegenome sequences (Supplementary Fig. 7C), and entire whole-genome sequences (Supplementary Fig. 8C) placed Ev as the basal group, branching earlier than S1/S2.In comparison, the AF phylogenies for coding regions (i.e., coding sequences [CDS; Supplementary Fig. 7D] and protein sequences [Supplementary Fig. 8D]) were largely congruent with the phylogeny of 18S rRNA gene (Supplementary Fig. 7A), placing S1 as earlier branching than Ev.The branching orders of Ev versus S1/S2 in AF trees were supported by a robust jackknife support of ³ 96% based on 100 subsampled replicates (Supplementary Figs.7B-D).This trend is consistent with visualisation of the AF distances as a phylogenomic network (Supplementary Fig. 8E), in which E. voratum is more closely related to P. glacialis based on intron sequences, and to Symbiodinium spp.based on CDS.
The incongruence between phylogenies inferred from coding versus non-coding regions with robust support of the distinct clades clearly indicate differential selective pressure acting on these two regions in Symbiodiniaceae genomes, as demonstrated in an earlier study [15].
This mixture ground using a pre-chilled mortar and pestle to fine powder, swirled with liquid nitrogen, and transferred to a chilled 15mL Falcon tube.Lysis buffer (5mL, 65°C) was added, and an equal volume of chloroform:isoamyl alcohol (24:1 v/v) was mixed in.The mixture was divided into aliquots (2mL) in Eppendorf tubes and centrifuged (10000g, 10 min, 4°C).

Generation of genome data
For short-read sequencing, the libraries for RCC1521 and rt-383 were prepared using the Illumina TruSeq Nano kit with 350 bp targeted inserts following standard protocol, and sequenced on the NovaSeq 6000 platform at Australian Genome Research Facility (Melbourne, Australia).For CCMP421, the genomic DNA library was prepared using the Chromium Genome Reagent Kit v2 Chemistry following the manufacturer's protocol (Step 2 GEM generation and barcoding onwards) and sequenced on the NovaSeq 6000 platform at Florida International University.
For Nanopore sequencing of RCC1521 and rt-383, the ligation kit SQK LSK-109 was used following standard protocol.Each library was sequenced on a MinION flow cell, and rt-383 gDNA was further sequenced using a PromethION flow cell at the University of Queensland Genome Innovation Hub (Brisbane, Australia).The sequence reads were base-called using guppy v4.0.11 of MinKNOW v20.06.18 (minimum read quality filter = 7).
For PacBio sequencing, the gDNA of RCC1521 (unsheared, for continuous long-read [CLR] library) and rt-383 (sheared in 15-20Kb fragments with Pippin Prep [Sage Science] for HiFi library) were used for library preparation using the SMRTbell Express Template Prep Kit 2.0 following the manufacturer's protocol.The RCC1521 and the rt-383 libraries were sequenced on the PacBio Sequel II platform, respectively at the University of Washington PacBio Sequencing Services (Seattle, WA, USA) and at the University of Queensland Sequencing Facility (Brisbane, Australia).CLR reads were acquired using the PacBio BAM2fastx toolkit.
HiFi consensus reads were obtained using the CCS module of SMRT Link pipeline v8.0.

Generation of transcriptome data
Transcriptome data were generated for RCC1521 and rt-383.Illumina RNA-Seq libraries were generated using polyA-selection with the Dynabeads mRNA purification Kit (Thermo

Estimation of genome size from sequencing data
Illumina short reads were used for estimating genome size based on k-mers.The reads from each genome dataset were processed to remove potential adapters (for 10X linked-reads of CCMP421, specifically the first 23 bases) and polyG tails using fastp v0.20.0 [17].Jellyfish v2.3.0 [18] was used to obtain k-mers, independently for each odd k value between 17 and 31, inclusive.Genome sizes were estimated using the k-mer distributions, as the sum of observed k-mers divided by the k-mer coverage corresponding to peak of the distribution (Supplementary Table 9).Ploidy of each genome dataset was assessed using GenomeScope2 [19] based on 21-mer distribution, with better fit observed for haploid model (p=1) in all three E. voratum genomes.Genome size estimations of other dinoflagellates (Supplementary Table 8) were obtained from earlier studies [20][21][22][23].
The resulted genome scaffolds that did not have other predicted genes were considered putative mitochondrial genome sequences.

Ab initio prediction of protein-coding genes
To predict protein-coding genes, we used a workflow customised for dinoflagellates following Chen et al. [29], incorporating protein and transcriptome evidence and multiple predictors (https://github.com/TimothyStephens/Dinoflagellate_Annotation_Workflow).
For transcript-based gene prediction, Iso-Seq transcripts where available (i.e., for RCC1521 and rt-383) were mapped on the corresponding genome assembly using minimap2 v2.18 for which the code was modified to recognise dinoflagellate alternative splice sites, with --secondary=no -ax splice:hq -uf --splice-flank=no.The assembled transcripts from RNA-Seq (both de novo and genome-guided) for each isolate were mapped to the corresponding genome assembly using BLAT [33]; for CCMP421, de novo assembled transcripts from RCC1521 and rt-383, plus the assembly of these reads guided by the CCMP421 genome, were used in this step.The resulting GFF3 files were input into PASA v2.4.1 [34] modified to recognise dinoflagellate alternative splice sites, with options --
Gene models from the five tools (GeneMark-ES, MAKER, PASA, SNAP, AUGUSTUS) were integrated using EVidenceModeler v1.1.1 [40], following Chen et al. [41].Finally, the resulting gene models were refined to correct exon boundaries, identify untranslated regions, and incorporate alternative splice-forms using the Load_Current_Gene_Annotations.dbi and Launch_PASA_pipeline.pl from the PASA pipeline [42,43] iteratively for three rounds to yield the final gene models.

Consistent functional annotation of predicted genes
To ensure comparability of gene functions among different genomes, we annotated functions of all proteins predicted from the 21 Suessiales genomes (Supplementary Table 7) using a consistent approach.Functions of protein sequences were annotated based on BLASTp searches (e-value < 10 -5 , query/subject cover ≥ 50 %) against SwissProt (2022_01).Those that had no hits were searched against TrEMBL (2022_01); the function of the top protein hit was assumed to the putative function of the query protein.Gene Ontology (GO) terms associated with each UniProt identifier were acquired using the UniProtKB ID mapping tool (https://www.uniprot.org/id-mapping;December 2022).

Enrichment of Gene Ontology terms
For the analysis of gene family evolution of Suessiales taxa (Materials and Methods), Gene Ontology (GO) terms enrichment was performed for six comparisons: (a) shared genes in Ev+Po (test set) versus all genes in Ev+Po (background), (b) shared genes in S1+S2 versus all genes in S1 and S2, (c) shared genes in S1+S2+Po versus all genes in S1, S2, and Po, (d) shared genes in S1+S2+Ev+Po versus all genes in the 21 taxa, (e) exclusive genes to S1 versus all S1 genes, and (f) genes exclusive to Ev versus all Ev genes.

Inference of species tree
To reconstruct a species tree of dinoflagellates based on strictly orthologous protein sequences, we incorporated 1,603,073 predicted protein sequences from 33 dinoflagellate taxa, comprising 21 Suessiales taxa (including the three E. voratum isolates) and 12 other taxa external to Suessiales (Supplementary Table 8).These sequences were clustered into homologous sets using OrthoFinder v2.5.4 [45], from which a species tree was estimated from strictly orthologous sets.

Alignment-free phylogenetic inference and core k-mers
Alignment-free (AF) approach was used to infer phylogenetic relationships from (a) wholegenome sequences (WGS) and from distinct genomic regions of (b) repeat-masked WGS, (c) coding sequences (CDS), (d) introns, (e) annotated repeats, and (f) predicted protein sequences.Each of these distinct regions were extracted from assembled genome sequences using gff3_file_to_feature_files.plimplemented in PASA [34].We followed Lo et al. [15] to identify optimal k-mer length (k) for each of these datasets.Briefly, for each dataset, k-mers at varied length k were enumerated using Jellyfish v2.3.0 [18]; for all datasets, odd-numbered k between 13 and 27 were used, except for the repeats dataset (odd-numbered k values between 13 and 51 were used) and protein sequences (odd-numbered k values between 3 and 9 were used).For each dataset except the protein sequences, the optimal k was determined based on the cumulative proportion of unique k-mers and the cumulative proportion of distinct k-mers, at the point when distributions of both proportions reached a plateau (Supplementary Fig. 1); k value determined this way was found to yield the greatest distinguishing power for phylogenetic analysis [46].For protein sequences, we followed Lo et al. [15] to infer AF phylogenies (below) from each k and chose the k with a topology that best matched the alignment-based tree of the 18S rRNA genes [15].The optimal k was identified as 23 for WGS and repeat-masked WGS, 19 for CDS, 21 for introns, 51 for annotated repeats, and 9 for protein sequences.Jellyfish v2.3.0 was used to extract k-mers at the corresponding optimal k length for each dataset; option -C was used to enforce strandspecific directionality for the intron and CDS datasets.
In each pseudo-replicate, 40% of the data, in iteratively subsampled 100-bp regions at random, were deleted using the Python script jackknife.pyfrom which an AF tree was inferred; the R script Jackknife.rwas then used to calculate jackknife support value in percentage, among the pseudo-replicate trees, for each node in the original AF tree.These scripts are available at https://github.com/chanlab-genomics/alignment-free-tools.
To identify core k-mers [15,50] that are shared by all 21 Suessiales genomes used in this study, we used the optimal k = 23 for the WGS dataset.Using the extracted 23-mers from the entire WGS dataset as input, core 23-mers were identified using the bash command comm (-12).BEDtools [26] intersect was used to identify overlaps between the core k-mers and annotated genomic features.

Analysis of mRNA editing
For the analysis of mRNA editing, we focused on E. voratum RCC1521, and the representative genomes for S1 (S. microadriaticum CCMP2467) and S2 (D. trenchii CCMP2556), for which high-quality genome and transcriptome data are available.Editing of mRNAs was identified using JACUSA2 [11], based on observed nucleotide variants in the mapping of transcripts onto the genome, relative to the mapping of genome sequence reads onto the genome.For each genome, the gDNA reads were mapped on the genome assembly using BWA-mem v0.7.17-r1198 [51] at default setting.RNA-Seq reads were then mapped using HISAT2 v2.2.0 [52] (--rna-strandness RF) against the genome assembly, with a HGFM HISAT2 index (hisat2-build --exon --ss) informed by the annotated splice sites.To generate this index, the gene annotation file in GFF3 was converted to the GTF format using Gffread [53], and splice site and exon locations extracted with the hisat2_extract_splice_sites.py and hisat2_extract_exons.py scripts.Iso-Seq reads, where available, were mapped using minimap2 v2.18 [54] for which the code was modified to recognise alternative splice sites of dinoflagellates, with options --splice-flank=no -secondary=no -ax splice:hq -uf --junc-bed.
Duplicate mappings were removed from the gDNA BAM files using Picard MarkDuplicates (ASSUME_SORTED=true REMOVE_DUPLICATES=true CREATE_INDEX=TRUE VALIDATION_STRINGENCY=LENIENT).The MD field documenting mismatched and deleted bases was added to the gDNA BAM files using samtools calmd (-b), as required as input for JACUSA2.JACUSA2 analysis was performed on the gDNA, RNA-Seq, and Iso-Seq BAM files using option -a D,Y,H to remove false positives caused by read starts/ends, indels, splice sites, and homopolymers.Due to the different strand directionality, -P2 RF-FIRSTSTRAND was specified for the runs incorporating RNA-Seq data, and -P2 FR-SECONDSTRAND for the runs using Iso-Seq data.The overlap between RNA edited sites and gene models (and isoforms) was identified using BEDtools [26] intersect at -s -wo.
We followed Liew et al. [9] to assess 5′ bias in the location of RNA editing and the propensity for edits to occur together.We calculated the frequency of the locations of each edit with respect to the features they were in (exon, intron, gene), normalised by the length of each feature.For each edit, its distance (in bp) to the closest upstream edit, and that to the closest downstream edit where all within the same gene, were determined.The average of these two values was used as the per-edit observed distance to neighbouring edits.

Analysis of introner elements
To identify introner elements (IEs), we used Pattern Locator [55] following Farhat et al. [8]: inverted repeats of 8-20 nucleotides within 30 bases of the 5′ and 3′ ends of each intron, flanked by direct repeats of 3-5 nucleotides.We used Seqkit to obtain the first and last 30 bases at each end of introns, then Pattern Locator to identify the IEs.

Analysis of lineage-specific protein sets
To assess if our observation of lineage-specific protein sets was caused by the unbalanced representation of protein datasets among the three groups (i.e., 9 taxa in S1, 7 in S2, and 3 in Ev), we examined the number of lineage-specific protein sets across 100 independent tests.
For each test, among the 811,661 protein sequences from the 21 Suessiales datasets, three datasets from S1 and three from S2 were randomly sampled and included, to achieve the balanced representation of 3 datasets per group; the outgroup Po remains as 2 since we only have data from two isolates of Polarella glacialis.The protein sequences from these 11-taxon datasets were then clustered into homologous sets using OrthoFinder v2.5.4 [45] at default setting.Of the 100 tests, majority yielded more S1-specific sets than Ev-specific sets (64 cases), and more S2-specific sets than Ev-specific sets (69), whereas in 43 cases, both S1specific and S2-specific sets are greater than Ev-specific sets (Supplementary Table 15).
These numbers, while do not meet the usual expectation of a statistical significance (e.g.>90%), provide an insight into how sampling bias of Symbiodiniaceae genomes could affect our results.On the other hand, we observed greater number of protein sets shared by Po with S1 and S2 (i.e.S1+S2+Po) than with Ev (i.e.Ev+Po) in 99 of the 100 cases, indicating statistical robustness of this trend based on the datasets analysed.Based on these results, we caution that the number of lineage-specific protein sets need to be interpreted in the appropriate context, and not taken at face value.Protein sets identified this way, based on sequence similarity using OrthoFinder, represent a proxy for studying evolution of protein or gene families.Given that Ev consists only sequences from one species, whereas those in S1 and S2 comprise datasets of different species and genera, greater divergence of protein sequences in these groups compared to Ev is reasonable, as reflected in our results.

Fisher
) and the Illumina Stranded mRNA Prep following standard protocols.Sequencing was performed on the Illumina NovaSeq 6000 platform at the Australian Genome Research Facility (Melbourne, Australia).PacBio Iso-Seq libraries were prepared using the NEBNext® Single Cell/Low Input cDNA Synthesis and Amplification Module (New England BioLabs) and the SMRTbell Express Template Prep Kit 2.0 following standard protocol and sequenced on the PacBio Sequel II at the University of Queensland Sequencing Facility (Brisbane, Australia).